7 research outputs found

    Evaluating cognitive load of text-to-speech synthesis

    Get PDF
    This thesis addresses the vital topic of evaluating synthetic speech and its impact on the end-user, taking into consideration potential negative implications on cognitive load. While conventional methods like transcription tests and Mean Opinion Scores (MOS) tests offer a valuable overall understanding of system performance, they fail to provide deeper insights into the reasons behind the performance. As text-to-speech (TTS) systems are increasingly used in real-world applications, it becomes crucial to explore whether synthetic speech imposes a greater cognitive load on listeners compared to human speech, as excessive cognitive effort could lead to fatigue over time. The study focuses on assessing the cognitive load of synthetic speech by presenting two methodologies: the dual-task paradigm and pupillometry. The dual-task paradigm initially seemed promising but was eventually deemed unreliable and unsuitable due to uncertainties in experimental setups which requires further investigation. However, pupillometry emerged as a viable approach, demonstrating its efficacy in detecting differences in cognitive load among various speech synthesizers. Notably, the research confirmed that accurate measurement of listening difficulty requires imposing sufficient cognitive load on listeners. To achieve this, the most viable experimental setup involved measuring the pupil response while listening to speech in the presence of noise. Through these experiments, intriguing contrasts between human and synthetic speech were revealed. Human speech consistently demanded the least cognitive load. On the other hand, state-of-the-art TTS systems showed promising results, indicating a significant improvement in their cognitive load performance compared to rule-based synthesizers of the past. Pupillometry offers a deeper understanding of the contributing factors to increased cognitive load in synthetic speech processing. Particularly, an experiment highlighted that the separate modeling of spectral feature prediction and duration in TTS systems led to heightened cognitive load. However, encouragingly, many modern end-to-end TTS systems have addressed these issues by predicting acoustic features within a unified framework, and thus effectively reducing the overall cognitive load imposed by synthetic speech. As the gap between human and synthetic speech diminishes with advancements in TTS technology, continuous evaluation using pupillometry remains essential for optimizing TTS systems for low cognitive load. Although pupillometry demands advanced analysis techniques and is time-consuming, the meaningful insights it provides into the cognitive load of synthetic speech contribute to an enhanced user experience and better TTS system development. Overall, this work successfully establishes pupillometry as a viable and effective method for measuring cognitive load of synthetic speech, propelling synthetic speech evaluation beyond traditional metrics. By gaining a deeper understanding of synthetic speech's interaction with the human cognitive processing system, researchers and developers can work towards creating TTS systems that offer improved user experiences with reduced cognitive load, ultimately enhancing the overall usability and acceptance of such technologies. Note: There was a 2-year break in the work reported in this thesis where an initial pilot was performed in early 2020 and was then suspended due to the covid-19 pandemic. Experiments were therefore rerun in 2022/23 with the most recent state-of-the-art models so that we could determine whether the increased cognitive load result is still applicable. This thesis was thus concluded by answering whether such cognitive load methods developed in this thesis are still useful, practical and/or relevant for current state-of-the-art text-to-speech systems

    Evaluating Cognitive Load of Text-To-Speech (TTS) synthesis

    Get PDF
    Current evaluation methods for text-to-speech (TTS) synthesis rely solely on subjective rating scores. Thesetests typically account mostly for how natural or intelligible the voice is. With state-of-the-art systems, thesemeasures are approaching ceiling and therefore alternative measures such as the cognitive load may becomemore meaningful. To our knowledge, there is little or no recent work evaluating the cognitive load of state-of- the-art text-to-speech systems. We use pupillometry as a measure of cognitive load. The pupil has beenfound to dilate upon increased cognitive effort when carrying out a listening task. Currently we are evaluatingspeech generated by a Deep Neural Network TTS synthesiser. In our method, we generate stimuli that stepincrementally from natural speech to synthesized speech by changing only a single feature at a time. Stimuli arepresented to listeners in speech-shaped noise conditions

    ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

    Get PDF
    Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona-fide utterances even by human subjects.Comment: Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.10111

    ASVspoof 2019: a large-scale public database of synthetized, converted and replayed speech

    No full text
    Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as “presentation attacks.” These vulnerabilities are generally unacceptable and call for spoofing countermeasures or “presentation attack detection” systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof challenge initiative was created to foster research on anti-spoofing and to provide common platforms for the assessment and comparison of spoofing countermeasures. The first edition, ASVspoof 2015, focused upon the study of countermeasures for detecting of text-to-speech synthesis (TTS) and voice conversion (VC) attacks. The second edition, ASVspoof 2017, focused instead upon replay spoofing attacks and countermeasures. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona fide utterances even by human subjects. It is expected that the ASVspoof 2019 database, with its varied coverage of different types of spoofing data, could further foster research on anti-spoofing.Peer reviewe
    corecore